12 research outputs found

    Talus: A simple way to remove cliffs in cache performance

    Get PDF
    Caches often suffer from performance cliffs: minor changes in program behavior or available cache space cause large changes in miss rate. Cliffs hurt performance and complicate cache management. We present Talus, a simple scheme that removes these cliffs. Talus works by dividing a single application's access stream into two partitions, unlike prior work that partitions among competing applications. By controlling the sizes of these partitions, Talus ensures that as an application is given more cache space, its miss rate decreases in a convex fashion. We prove that Talus removes performance cliffs, and evaluate it through extensive simulation. Talus adds negligible overheads, improves single-application performance, simplifies partitioning algorithms, and makes cache partitioning more effective and fair.National Science Foundation (U.S.) (Grant CCF-1318384

    Jigsaw: Scalable software-defined caches

    Get PDF
    Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.United States. Defense Advanced Research Projects Agency (DARPA PERFECT contract HR0011-13-2-0005)Quanta Computer (Firm

    Distributed naming in a fos

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 87-89).A factored operating system (fos) is a new operating system design for manycore and cloud computers. In fos, OS services are separated from application code and run on distinct cores. Furthermore, each service is split into a fleet, or parallel set of cooperating processes that communicate using messages. Applications discover OS services through a distributed, dynamic name service. Each core runs a thin microkernel, and applications link in a user-space library called libfos that translates service requests into messages. The name service facilitates message delivery by looking up service locations and load balancing within service fleets. libfos caches service locations in a private cache to accelerate message delivery, and invalid entries are detected and invalidated by the microkernel. As messaging is the primary communication medium in fos, the name service plays a foundational role in the system. It enables key concepts of fos's design, such as fleets, communication locality, elasticity, and spatial scheduling. It is also one of the first complex services implemented in fos, and its implementation provides insight into issues one encounters while developing a distributed fos service. This thesis describes the design and implementation of the naming system in fos, including the naming and messaging system within each application and the distributed name service itself. Scaling numbers for the name service are presented for various workloads, as well as end-to-end performance numbers for two benchmarks. These numbers indicate good scaling of the name service with expected usage patterns, and superior messaging performance of the new naming system when compared with its prior implementation. The thesis concludes with research directions for future work.by Nathan Beckmann.S.M

    Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling

    Get PDF
    Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint. We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.National Science Foundation (U.S.) (Grant CCF-1318384)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Jacobs Presidential Fellowship)United States. Defense Advanced Research Projects Agency (PERFECT Contract HR0011-13-2-0005

    Rubik: fast analytical power management for latency-critical systems

    Get PDF
    Latency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low utilization wastes billions of dollars in energy and equipment annually. Applying dynamic power management to latency-critical workloads is challenging. The fundamental issue is coping with their inherent short-term variability: requests arrive at unpredictable times and have variable lengths. Without knowledge of the future, prior techniques either adapt slowly and conservatively or rely on application-specific heuristics to maintain tail latency. We propose Rubik, a fine-grain DVFS scheme for latency-critical workloads. Rubik copes with variability through a novel, general, and efficient statistical performance model. This model allows Rubik to adjust frequencies at sub-millisecond granularity to save power while meeting the target tail latency. Rubik saves up to 66% of core power, widely outperforms prior techniques, and requires no application-specific tuning. Beyond saving core power, Rubik robustly adapts to sudden changes in load and system performance. We use this capability to design RubikColoc, a colocation scheme that uses Rubik to allow batch and latency-critical work to share hardware resources more aggressively than prior techniques. RubikColoc reduces datacenter power by up to 31% while using 41% fewer servers than a datacenter that segregates latency-critical and batch work, and achieves 100% core utilization.National Science Foundation (U.S.) (Grant CCF-1318384

    Design and analysis of spatially-partitioned shared caches

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.Cataloged from PDF version of thesis.Includes bibliographical references (pages 167-176).Data movement is a growing problem in modern chip-multiprocessors (CMPs). Processors spend the majority of their time, energy, and area moving data, not processing it. For example, a single main memory access takes hundreds of cycles and costs the energy of a thousand floating-point operations. Data movement consumes more than half the energy in current processors, and CMPs devote more than half their area to on-chip caches. Moreover, these costs are increasing as CMPs scale to larger core counts. Processors rely on the on-chip caches to limit data movement, but CMP cache design is challenging. For efficiency reasons, most cache capacity is shared among cores and distributed in banks throughout the chip. Distribution makes cores sensitive to data placement, since some cache banks can be accessed at lower latency and lower energy than others. Yet because applications require sufficient capacity to fit their working sets, it is not enough to just use the closest cache banks. Meanwhile, cores compete for scarce capacity, and the resulting interference, left unchecked, produces many unnecessary cache misses. This thesis presents novel architectural techniques that navigate these complex tradeoffs and reduce data movement. First, virtual caches spatially partition the shared cache banks to fit applications' working sets near where they are used. Virtual caches expose the distributed banks to software, and let the operating system schedule threads and their working sets to minimize data movement. Second, analytical replacement policies make better use of scarce cache capacity, reducing expensive main memory accesses: Talus eliminates performance cliffs by guaranteeing convex performance, and EVA uses planning theory to derive the optimal replacement metric under uncertainty. These policies improve performance and make qualitative contributions: Talus is cheap to predict, and so lets cache partitioning techniques (including virtual caches) work with high-performance cache replacement; and EVA shows that the conventional approach to practical cache replacement is sub-optimal. Designing CMP caches is difficult because architects face many options with many interacting factors. Unlike most prior caching work that employs best-effort heuristics, we reason about the tradeoffs through analytical models. This analytical approach lets us achieve the performance and efficiency of application-specific designs across a broad range of applications, while further providing a coherent theoretical framework to reason about data movement. Compared to a 64-core CMP with a conventional cache design, these techniques improve end-to-end performance by up to 76% and an average of 46%, save 36% of system energy and reduce cache area by 10%, while adding small area, energy, and runtime overheads.by Nathan Beckmann.Ph. D

    Modeling cache performance beyond LRU

    No full text
    Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.National Science Foundation (U.S.) (Grant CCF-1318384)Qatar Computing Research Institut

    Jenga: Software-Defined Cache Hierarchies

    No full text
    Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications. We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.National Science Foundation (U.S.) (Grant CCF-1318384)National Science Foundation (U.S.). Career (Grant 1452994

    ATAC: Improving performance and programmability

    No full text
    Given the current trends in multicore scaling, chips with 1000 cores may exist within the next 5 to 10 years. However, their promise of increased performance will only be reached if their inherent scaling and programming challenges are overcome. Meanwhile, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a reality-interconnect technology which can provide more bandwidth at lower power than conventional electronics. Perhaps more importantly, optical interconnect also has the potential to enable new, easy-to-use programming models enabled by its inexpensive broadcast mechanism. This paper introduces ATAC, a new manycore architecture that capitalizes on the recent advances in optics to address a number of challenges that future manycore designs will face. The new constraints and opportunities of on-chip optical interconnect are presented and explored in the design of ATAC. Furthermore, this paper discusses ATAC's programming models, and introduces Consumer Tagging, a novel programming model that leverages ATAC's strengths to provide high performance and scalability

    An operating system for multicore and clouds: Mechanisms and implementation

    Get PDF
    Cloud computers and multicore processors are two emerging classes of computational hardware that have the potential to provide unprecedented compute capacity to the average user. In order for the user to effectively harness all of this computational power, operating systems (OSes) for these new hardware platforms are needed. Existing multicore operating systems do not scale to large numbers of cores, and do not support clouds. Consequently, current day cloud systems push much complexity onto the user, requiring the user to manage individual Virtual Machines (VMs) and deal with many system-level concerns. In this work we describe the mechanisms and implementation of a factored operating system named fos. fos is a single system image operating system across both multicore and Infrastructure as a Service (IaaS) cloud systems. fos tackles OS scalability challenges by factoring the OS into its component system services. Each system service is further factored into a collection of Internet-inspired servers which communicate via messaging. Although designed in a manner similar to distributed Internet services, OS services instead provide traditional kernel services such as file systems, scheduling, memory management, and access to hardware. fos also implements new classes of OS services like fault tolerance and demand elasticity. In this work, we describe our working fos implementation, and provide early performance measurements of fos for both intra-machine and inter-machine operations.Quanta Computer (Firm)United States. Defense Advanced Research Projects AgencyUnited States. Air Force Research LaboratoryGoogle (Firm
    corecore